NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Proximal Causal Inference with Text Data

Chen, Jacob M; Bhattacharya, Rohit; Keith, Katherine A (December 2024, Advances in neural information processing systems)

Recent text-based causal methods attempt to mitigate confounding bias by estimating proxies of confounding variables that are partially or imperfectly measured from unstructured text data. These approaches, however, assume analysts have supervised labels of the confounders given text for a subset of instances, a constraint that is sometimes infeasible due to data privacy or annotation costs. In this work, we address settings in which an important confounding variable is completely unobserved. We propose a new causal inference method that uses two instances of pre-treatment text data, infers two proxies using two zero-shot models on the separate instances, and applies these proxies in the proximal g-formula. We prove, under certain assumptions about the instances of text and accuracy of the zero-shot predictions, that our method of inferring text-based proxies satisfies identification conditions of the proximal g-formula while other seemingly reasonable proposals do not. To address untestable assumptions associated with our method and the proximal g-formula, we further propose an odds ratio falsification heuristic that flags when to proceed with downstream effect estimation using the inferred proxies. We evaluate our method in synthetic and semi-synthetic settings---the latter with real-world clinical notes from MIMIC-III and open large language models for zero-shot prediction---and find that our method produces estimates with low bias. We believe that this text-based design of proxies allows for the use of proximal causal inference in a wider range of scenarios, particularly those for which obtaining suitable proxies from structured data is difficult.
more » « less
Full Text Available
"Let Me Just Interrupt You": Estimating Gender Effects in Supreme Court Oral Arguments

Cai, Erica; Gupta, Ankita; Keith, Katherine; O'Connor, Brendan; Rice, Douglas R. (January 2023, SocArxiv)

Full Text Available
Paying Attention to the Algorithm Behind the Curtain: Bringing Transparency to YouTube’s Demonetization Algorithms

Dunna, Arun; Keith, Katherine A.; Zuckerman, Ethan; Vallina-Rodriguez, Narseo; O'Connor, Brendan; Nithyanand, Rishab (January 2022, Proceedings of the 2022 ACM Conference on Computer Supported Cooperative Work)

Full Text Available
Text as Causal Mediators: Research Design for Causal Estimates of Differential Treatment of Social Groups via Language Aspects

https://doi.org/10.18653/v1/2021.cinlp-1.2

Keith, Katherine; Rice, Douglas; O’Connor, Brendan (January 2021, First Workshop on Causal Inference & NLP (CI-NLP) at EMNLP 2021)

Full Text Available
Corpus-Level Evaluation for Event QA: The IndiaPoliceEvents Corpus Covering the 2002 Gujarat Violence

https://doi.org/10.18653/v1/2021.findings-acl.371

Halterman, Andrew; Keith, Katherine; Sarwar, Sheikh; O’Connor, Brendan (January 2021, Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021)
null (Ed.)
Automated event extraction in social science applications often requires corpus-level evaluations: for example, aggregating text predictions across metadata and unbiased estimates of recall. We combine corpus-level evaluation requirements with a real-world, social science setting and introduce the IndiaPoliceEvents corpus—all 21,391 sentences from 1,257 English-language Times of India articles about events in the state of Gujarat during March 2002. Our trained annotators read and label every document for mentions of police activity events, allowing for unbiased recall evaluations. In contrast to other datasets with structured event representations, we gather annotations by posing natural questions, and evaluate off-the-shelf models for three different tasks: sentence classification, document ranking, and temporal aggregation of target events. We present baseline results from zero-shot BERT-based models fine-tuned on natural language inference and passage retrieval tasks. Our novel corpus-level evaluations and annotation approach can guide creation of similar social-science-oriented resources in the future.
more » « less
Full Text Available
Text and Causal Inference: A Review of Using Text to Remove Confounding from Causal Estimates

https://doi.org/10.18653/v1/2020.acl-main.474

Keith, Katherine; Jensen, David; O’Connor, Brendan (January 2020, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics)

Many applications of computational social science aim to infer causal conclusions from non-experimental data. Such observational data often contains confounders, variables that influence both potential causes and potential effects. Unmeasured or latent confounders can bias causal estimates, and this has motivated interest in measuring potential confounders from observed text. For example, an individual’s entire history of social media posts or the content of a news article could provide a rich measurement of multiple confounders.Yet, methods and applications for this problem are scattered across different communities and evaluation practices are inconsistent.This review is the first to gather and categorize these examples and provide a guide to data-processing and evaluation decisions. Despite increased attention on adjusting for confounding using text, there are still many open problems, which we highlight in this paper.
more » « less
Full Text Available
Uncertainty over Uncertainty: Investigating the Assumptions, Annotations, and Text Measurements of Economic Policy Uncertainty

https://doi.org/http://dx.doi.org/10.18653/v1/2020.nlpcss-1.13

Keith, Katherine; Teichmann, Christoph; O'Connor, Brendan; Meij, Edgar (January 2020, Proceedings of the Fourth Workshop on Natural Language Processing and Computational Social Science)
null (Ed.)
Methods and applications are inextricably linked in science, and in particular in the domain of text-as-data. In this paper, we examine one such text-as-data application, an established economic index that measures economic policy uncertainty from keyword occurrences in news. This index, which is shown to correlate with firm investment, employment, and excess market returns, has had substantive impact in both the private sector and academia. Yet, as we revisit and extend the original authors’ annotations and text measurements we find interesting text-as-data methodological research questions: (1) Are annotator disagreements a reflection of ambiguity in language? (2) Do alternative text measurements correlate with one another and with measures of external predictive validity? We find for this application (1) some annotator disagreements of economic policy uncertainty can be attributed to ambiguity in language, and (2) switching measurements from keyword-matching to supervised machine learning classifiers results in low correlation, a concerning implication for the validity of the index.
more » « less
Full Text Available
Modeling financial analysts’ decision making via the pragmatics and semantics of earnings calls

Keith, Katherine A; Stent, Amanda (January 2019, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics)

Full Text Available

Search for: All records